[Numpy] Fix AWS Batch + Add Docker Support #1302

sxjscience · 2020-08-18T09:47:57Z

Fix batch. We will need to get the name of the log stream from describeJobsResponse.
Add docker support. Currently, by launching the docker, it will launch a JupyterLab development environment with GluonNLP installed. The docker image has been uploaded to docker hub: https://hub.docker.com/repository/docker/gluonai/gluon-nlp

docker pull gluonai/gluon-nlp:v1.0.0
docker run --gpus all --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 gluonai/gluon-nlp:v1.0.0

@dmlc/gluon-nlp-committers

Should solve #1243 and #1139

codecov · 2020-08-18T10:37:53Z

Codecov Report

Merging #1302 into master will increase coverage by 0.15%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1302      +/-   ##
==========================================
+ Coverage   84.14%   84.30%   +0.15%     
==========================================
  Files          42       42              
  Lines        6397     6397              
==========================================
+ Hits         5383     5393      +10     
+ Misses       1014     1004      -10

Impacted Files	Coverage Δ
src/gluonnlp/__init__.py	`100.00% <100.00%> (ø)`
src/gluonnlp/utils/misc.py	`50.63% <0.00%> (-1.27%)`	⬇️
src/gluonnlp/data/loading.py	`83.39% <0.00%> (+5.28%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 32e87d4...dbc34be. Read the comment docs.

szha · 2020-08-18T16:53:18Z

LICENSE

+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright [yyyy] [name of copyright owner]


let me get back to you what we update this to.

leezu · 2020-08-18T17:30:17Z

tools/docker/ubuntu18.04-devel-gpu.Dockerfile

+RUN wget -c https://www.openssl.org/source/openssl-${OPENSSL_VERSION}.tar.gz \
+ && apt-get update \
+ && apt remove -y --purge openssl \
+ && rm -rf /usr/include/openssl \
+ && apt-get install -y \
+    ca-certificates \
+ && tar -xzvf openssl-${OPENSSL_VERSION}.tar.gz \
+ && cd openssl-${OPENSSL_VERSION} \
+ && ./config && make -j $(nproc) && make test \
+ && make install \
+ && ldconfig \
+ && cd .. && rm -rf openssl-*


Do you mean why should we install openssl?

Why compile openssl from source?

I guess it's because we may want to have a customized version of openssl. Also, there might be security issues so that we may not install from another resource. That's the solution adopted in DLC: https://github.com/aws/deep-learning-containers/blob/95e4d9c9cba8b6dffec61637452b4bbd46bb59bd/mxnet/training/docker/1.6.0/py3/cu101/Dockerfile.gpu#L113-L124

Ubuntu manages the security fixes for you. No need to compile from source. I recommend you remove this part.

tools/docker/ubuntu18.04-devel-gpu.Dockerfile

barry-jin · 2020-08-18T17:44:02Z

tools/batch/submit-job.py

@@ -147,10 +148,10 @@ def main():
            sys.exit(status == 'FAILED')

        elif status == 'RUNNING':
-            logStreamName = getLogStream(logGroupName, jobName, jobId)
+            logStreamName = describeJobsResponse['jobs'][0]['container']['logStreamName']


Add LICESE + Examples for batch Update docker image update Update README.md Update README.md Update ubuntu18.04-devel.Dockerfile Update ubuntu18.04-devel.Dockerfile Update ubuntu18.04-devel.Dockerfile update Update ubuntu18.04-devel-gpu.Dockerfile fix Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile update Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile update update Update submit-job.py Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile try to fix fix batch Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile simplify bert test add files Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile fix Update ubuntu18.04-devel-gpu.Dockerfile

sxjscience · 2020-08-19T09:28:09Z

@leezu I tried to compile with the latest MXNet and install horovod via Haibin's branch. However, I'm seeing this error message:

/usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/crti.o: In function `_init':
(.init+0x7): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against undefined symbol `__gmon_start__'
CMakeFiles/nnvm.dir/3rdparty/tvm/nnvm/src/pass/print_graph_ir.cc.o: In function `std::_Function_base::_Base_manager<nnvm::pass::PrintGraphIR_(nnvm::Graph, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::ostream&)::{lambda(unsigned int, std::ostream&)#2}>::_M_manager(std::_Any_data&, std::_Function_base::_Base_manager<nnvm::pass::PrintGraphIR_(nnvm::Graph, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::ostream&)::{lambda(unsigned int, std::ostream&)#2}> const&, std::_Manager_operation)':
print_graph_ir.cc:(.text+0x3bb): relocation truncated to fit: R_X86_64_PC32 against `.data.rel.ro'
CMakeFiles/nnvm.dir/3rdparty/tvm/nnvm/src/pass/print_graph_ir.cc.o: In function `std::_Function_base::_Base_manager<nnvm::pass::PrintGraphIR_(nnvm::Graph, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::ostream&)::{lambda(unsigned int, std::ostream&)#1}>::_M_manager(std::_Any_data&, std::_Function_base::_Base_manager<nnvm::pass::PrintGraphIR_(nnvm::Graph, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::ostream&)::{lambda(unsigned int, std::ostream&)#1}> const&, std::_Manager_operation)':
print_graph_ir.cc:(.text+0x59b): relocation truncated to fit: R_X86_64_PC32 against `.data.rel.ro'
CMakeFiles/nnvm.dir/3rdparty/tvm/nnvm/src/pass/print_graph_ir.cc.o: In function `nnvm::pass::GetVectorPrinter(nnvm::Graph const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)':
print_graph_ir.cc:(.text+0x6d8): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `typeinfo name for std::vector<nnvm::TShape, std::allocator<nnvm::TShape> >' defined in .rodata._ZTSSt6vectorIN4nnvm6TShapeESaIS1_EE[_ZTSSt6vectorIN4nnvm6TShapeESaIS1_EE] section in CMakeFiles/nnvm.dir/3rdparty/tvm/nnvm/src/pass/infer_shape_type.cc.o
print_graph_ir.cc:(.text+0x708): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `typeinfo name for std::vector<int, std::allocator<int> >' defined in .rodata._ZTSSt6vectorIiSaIiEE[_ZTSSt6vectorIiSaIiEE] section in CMakeFiles/nnvm.dir/3rdparty/tvm/nnvm/src/pass/infer_shape_type.cc.o
print_graph_ir.cc:(.text+0x72b): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `typeinfo name for std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >' defined in .rodata._ZTSSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS5_EE[_ZTSSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS5_EE] section in CMakeFiles/nnvm.dir/3rdparty/tvm/nnvm/src/pass/print_graph_ir.cc.o
print_graph_ir.cc:(.text+0x75b): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `vtable for std::basic_ios<char, std::char_traits<char> >@@GLIBCXX_3.4' defined in .data.rel.ro section in /usr/lib/gcc/x86_64-linux-gnu/7/libstdc++.so
print_graph_ir.cc:(.text+0x7a0): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `VTT for std::__cxx11::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >@@GLIBCXX_3.4.21' defined in .data.rel.ro section in /usr/lib/gcc/x86_64-linux-gnu/7/libstdc++.so
print_graph_ir.cc:(.text+0x7c5): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `vtable for std::__cxx11::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >@@GLIBCXX_3.4.21' defined in .data.rel.ro section in /usr/lib/gcc/x86_64-linux-gnu/7/libstdc++.so
print_graph_ir.cc:(.text+0x7e3): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `vtable for std::basic_streambuf<char, std::char_traits<char> >@@GLIBCXX_3.4' defined in .data.rel.ro section in /usr/lib/gcc/x86_64-linux-gnu/7/libstdc++.so
print_graph_ir.cc:(.text+0x816): additional relocation overflows omitted from the output

Thus, I reverted to use the mxnet wheel package instead and commented out the codes related to horovod. Would you approve it if you feel that it's appropriate?

leezu · 2020-08-19T17:48:06Z

CODE_OF_CONDUCT.md

+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported by contacting the project team in GitHub issues/pull requests
+by mentioning @dmlc/gluon-nlp-committers. All
+complaints will be reviewed and investigated and will result in a response that
+is deemed necessary and appropriate to the circumstances. The project team is
+obligated to maintain confidentiality with regard to the reporter of an incident.
+Further details of specific enforcement policies may be posted separately.


Recommending to open a Github issue may not meet the confidentiality you promise here.

Should we create an issue and revise it in a later PR? Or I may remove this CODE_OF_CONDUCT for now.

leezu · 2020-08-19T20:38:20Z

tools/docker/ubuntu18.04-devel-gpu.Dockerfile

+ && apt-get clean \
+ && rm -rf /var/lib/apt/lists/*
+
+# Install CMake 3.13.3. The default in Ubuntu 18.04 is cmake 3.10.2


pip install cmake will be easier ;)

sxjscience · 2020-08-20T09:05:46Z

Horovod should have been added to the dockerfile.

zheyuye

I can confirm that docker image can be built and has been uploaded to Docker hub, and all unittests in Horovod related to Mxnet has been passed.

commit d8b68c6 Author: Xingjian Shi <[email protected]> Date: Thu Aug 20 08:47:56 2020 -0700 [Numpy] Fix AWS Batch + Add Docker Support (dmlc#1302) * Update submit-job.py Add LICESE + Examples for batch Update docker image update Update README.md Update README.md Update ubuntu18.04-devel.Dockerfile Update ubuntu18.04-devel.Dockerfile Update ubuntu18.04-devel.Dockerfile update Update ubuntu18.04-devel-gpu.Dockerfile fix Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile update Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile update update Update submit-job.py Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile try to fix fix batch Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile simplify bert test add files Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile fix Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * try to add back mxnet support * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * update * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * fix issues * update commit 6ae558e Author: ht <[email protected]> Date: Thu Aug 20 23:47:30 2020 +0800 [FEATURE]Horovod support for training transformer (PART 2) (dmlc#1301) * set default shuffle=True for boundedbudgetsampler * fix * fix log condition * use horovod to train transformer * fix * add mirror wmt dataset * fix * rename wmt.txt to wmt.json and remove part of urls * fix * tuning params * use get_repo_url() * update average checkpoint cli * paste result of transformer large * fix * fix logging in train_transformer * fix * fix * fix * add transformer base config * fix * change to wmt14/full * print more sacrebleu info * fix * add test for num_parts and update behavior of boundedbudgetsampler with even_size * fix * fix * fix * fix logging when using horovd * udpate doc of train transformer * add test case for fail downloading * add a ShardedIterator * fix * fix * fix * change mpirun to horovodrun * make the horovod command complete * use print(sampler) to cover the codes of __repr__ func * empty commit * add test case test_sharded_iterator_even_size Co-authored-by: Hu <[email protected]>

commit 7525618 Author: ZheyuYe <[email protected]> Date: Fri Aug 21 11:25:38 2020 +0800 Squashed commit of the following: commit d8b68c6 Author: Xingjian Shi <[email protected]> Date: Thu Aug 20 08:47:56 2020 -0700 [Numpy] Fix AWS Batch + Add Docker Support (dmlc#1302) * Update submit-job.py Add LICESE + Examples for batch Update docker image update Update README.md Update README.md Update ubuntu18.04-devel.Dockerfile Update ubuntu18.04-devel.Dockerfile Update ubuntu18.04-devel.Dockerfile update Update ubuntu18.04-devel-gpu.Dockerfile fix Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile update Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile update update Update submit-job.py Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile try to fix fix batch Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile simplify bert test add files Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile fix Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * try to add back mxnet support * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * update * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * fix issues * update commit 6ae558e Author: ht <[email protected]> Date: Thu Aug 20 23:47:30 2020 +0800 [FEATURE]Horovod support for training transformer (PART 2) (dmlc#1301) * set default shuffle=True for boundedbudgetsampler * fix * fix log condition * use horovod to train transformer * fix * add mirror wmt dataset * fix * rename wmt.txt to wmt.json and remove part of urls * fix * tuning params * use get_repo_url() * update average checkpoint cli * paste result of transformer large * fix * fix logging in train_transformer * fix * fix * fix * add transformer base config * fix * change to wmt14/full * print more sacrebleu info * fix * add test for num_parts and update behavior of boundedbudgetsampler with even_size * fix * fix * fix * fix logging when using horovd * udpate doc of train transformer * add test case for fail downloading * add a ShardedIterator * fix * fix * fix * change mpirun to horovodrun * make the horovod command complete * use print(sampler) to cover the codes of __repr__ func * empty commit * add test case test_sharded_iterator_even_size Co-authored-by: Hu <[email protected]> commit 1403c6e Author: ZheyuYe <[email protected]> Date: Fri Aug 21 11:15:44 2020 +0800 update uncased_bert_large commit 733a4b6 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 20:16:39 2020 +0800 adjust uncased_bert_large commit 770f079 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 15:10:57 2020 +0800 Revert "merge xingjian's" This reverts commit ea1f1aa. commit fe74dda Author: ZheyuYe <[email protected]> Date: Thu Aug 20 14:07:36 2020 +0800 update electra small commit 8972343 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 14:00:57 2020 +0800 add command to readme commit 8fcde49 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 12:30:47 2020 +0800 revise commit 7a625c4 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 12:21:58 2020 +0800 update reamde commit 071c6dd Author: ZheyuYe <[email protected]> Date: Wed Aug 19 17:14:53 2020 +0800 update bert squad command commit ea1f1aa Author: ZheyuYe <[email protected]> Date: Tue Aug 18 18:07:01 2020 +0800 merge xingjian's commit 859ab4d Author: ZheyuYe <[email protected]> Date: Tue Aug 18 17:47:01 2020 +0800 dummy example commit 633e683 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 17:36:31 2020 +0800 list_backbone_names commit b4aac59 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 17:32:51 2020 +0800 update readme commit 54301d9 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 13:59:06 2020 +0800 revise batch squad commit e019e27 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 13:58:49 2020 +0800 bash convert commit e01eda0 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 11:10:51 2020 +0800 update roberta commit 1730ff7 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 10:15:27 2020 +0800 revise submit commit de0b4c9 Author: ZheyuYe <[email protected]> Date: Mon Aug 17 16:07:58 2020 +0800 upload batch files commit 175de01 Author: ZheyuYe <[email protected]> Date: Mon Aug 17 16:05:02 2020 +0800 fix commit 0460ed3 Author: ZheyuYe <[email protected]> Date: Mon Aug 17 15:48:52 2020 +0800 upload commands

* Squashed commit of the following: commit 7525618 Author: ZheyuYe <[email protected]> Date: Fri Aug 21 11:25:38 2020 +0800 Squashed commit of the following: commit d8b68c6 Author: Xingjian Shi <[email protected]> Date: Thu Aug 20 08:47:56 2020 -0700 [Numpy] Fix AWS Batch + Add Docker Support (#1302) * Update submit-job.py Add LICESE + Examples for batch Update docker image update Update README.md Update README.md Update ubuntu18.04-devel.Dockerfile Update ubuntu18.04-devel.Dockerfile Update ubuntu18.04-devel.Dockerfile update Update ubuntu18.04-devel-gpu.Dockerfile fix Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile update Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile update update Update submit-job.py Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile try to fix fix batch Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update submit-job.py Update ubuntu18.04-devel-gpu.Dockerfile simplify bert test add files Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile Update ubuntu18.04-devel-gpu.Dockerfile fix Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * try to add back mxnet support * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * update * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * Update ubuntu18.04-devel-gpu.Dockerfile * fix issues * update commit 6ae558e Author: ht <[email protected]> Date: Thu Aug 20 23:47:30 2020 +0800 [FEATURE]Horovod support for training transformer (PART 2) (#1301) * set default shuffle=True for boundedbudgetsampler * fix * fix log condition * use horovod to train transformer * fix * add mirror wmt dataset * fix * rename wmt.txt to wmt.json and remove part of urls * fix * tuning params * use get_repo_url() * update average checkpoint cli * paste result of transformer large * fix * fix logging in train_transformer * fix * fix * fix * add transformer base config * fix * change to wmt14/full * print more sacrebleu info * fix * add test for num_parts and update behavior of boundedbudgetsampler with even_size * fix * fix * fix * fix logging when using horovd * udpate doc of train transformer * add test case for fail downloading * add a ShardedIterator * fix * fix * fix * change mpirun to horovodrun * make the horovod command complete * use print(sampler) to cover the codes of __repr__ func * empty commit * add test case test_sharded_iterator_even_size Co-authored-by: Hu <[email protected]> commit 1403c6e Author: ZheyuYe <[email protected]> Date: Fri Aug 21 11:15:44 2020 +0800 update uncased_bert_large commit 733a4b6 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 20:16:39 2020 +0800 adjust uncased_bert_large commit 770f079 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 15:10:57 2020 +0800 Revert "merge xingjian's" This reverts commit ea1f1aa. commit fe74dda Author: ZheyuYe <[email protected]> Date: Thu Aug 20 14:07:36 2020 +0800 update electra small commit 8972343 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 14:00:57 2020 +0800 add command to readme commit 8fcde49 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 12:30:47 2020 +0800 revise commit 7a625c4 Author: ZheyuYe <[email protected]> Date: Thu Aug 20 12:21:58 2020 +0800 update reamde commit 071c6dd Author: ZheyuYe <[email protected]> Date: Wed Aug 19 17:14:53 2020 +0800 update bert squad command commit ea1f1aa Author: ZheyuYe <[email protected]> Date: Tue Aug 18 18:07:01 2020 +0800 merge xingjian's commit 859ab4d Author: ZheyuYe <[email protected]> Date: Tue Aug 18 17:47:01 2020 +0800 dummy example commit 633e683 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 17:36:31 2020 +0800 list_backbone_names commit b4aac59 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 17:32:51 2020 +0800 update readme commit 54301d9 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 13:59:06 2020 +0800 revise batch squad commit e019e27 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 13:58:49 2020 +0800 bash convert commit e01eda0 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 11:10:51 2020 +0800 update roberta commit 1730ff7 Author: ZheyuYe <[email protected]> Date: Tue Aug 18 10:15:27 2020 +0800 revise submit commit de0b4c9 Author: ZheyuYe <[email protected]> Date: Mon Aug 17 16:07:58 2020 +0800 upload batch files commit 175de01 Author: ZheyuYe <[email protected]> Date: Mon Aug 17 16:05:02 2020 +0800 fix commit 0460ed3 Author: ZheyuYe <[email protected]> Date: Mon Aug 17 15:48:52 2020 +0800 upload commands * add mobilebert * replace remote * fix branch * fix typo Co-authored-by: Yuma1L <[email protected]>

sxjscience changed the title ~~[Numpy] Fix Batch + Add Docker Support~~ [Numpy] Fix AWS Batch + Add Docker Support Aug 18, 2020

szha reviewed Aug 18, 2020

View reviewed changes

leezu reviewed Aug 18, 2020

View reviewed changes

tools/docker/ubuntu18.04-devel-gpu.Dockerfile Outdated Show resolved Hide resolved

barry-jin reviewed Aug 18, 2020

View reviewed changes

sxjscience force-pushed the fix_batch branch from f760b34 to c4d8b4c Compare August 19, 2020 09:24

leezu reviewed Aug 19, 2020

View reviewed changes

sxjscience added 8 commits August 19, 2020 13:38

Update ubuntu18.04-devel-gpu.Dockerfile

cb47d14

try to add back mxnet support

2d80a06

Update ubuntu18.04-devel-gpu.Dockerfile

456eccc

Update ubuntu18.04-devel-gpu.Dockerfile

6d612e2

update

409cc67

Update ubuntu18.04-devel-gpu.Dockerfile

62e1ccb

Update ubuntu18.04-devel-gpu.Dockerfile

29fa8eb

Update ubuntu18.04-devel-gpu.Dockerfile

cc89a33

zheyuye mentioned this pull request Aug 20, 2020

Add Intro for batch + upload squad traininng command #1305

Merged

4 tasks

fix issues

c0a8912

zheyuye approved these changes Aug 20, 2020

View reviewed changes

update

dbc34be

sxjscience merged commit d8b68c6 into dmlc:master Aug 20, 2020

This was referenced Aug 20, 2020

Docker container for GluonNLP #1139

Closed

[Numpy Refactor] [Installation] Add Official Dockerfile support #1243

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Numpy] Fix AWS Batch + Add Docker Support #1302

[Numpy] Fix AWS Batch + Add Docker Support #1302

sxjscience commented Aug 18, 2020 •

edited

Loading

codecov bot commented Aug 18, 2020 •

edited

Loading

szha Aug 18, 2020

leezu Aug 18, 2020

sxjscience Aug 18, 2020

leezu Aug 18, 2020

sxjscience Aug 18, 2020

leezu Aug 18, 2020

barry-jin Aug 18, 2020

sxjscience commented Aug 19, 2020

leezu Aug 19, 2020

sxjscience Aug 19, 2020

leezu Aug 19, 2020

sxjscience commented Aug 20, 2020

zheyuye left a comment

[Numpy] Fix AWS Batch + Add Docker Support #1302

[Numpy] Fix AWS Batch + Add Docker Support #1302

Conversation

sxjscience commented Aug 18, 2020 • edited Loading

codecov bot commented Aug 18, 2020 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sxjscience commented Aug 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sxjscience commented Aug 20, 2020

zheyuye left a comment

Choose a reason for hiding this comment

sxjscience commented Aug 18, 2020 •

edited

Loading

codecov bot commented Aug 18, 2020 •

edited

Loading